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Abstract 

Background: As the magnitude of the experiment increases, it is common to combine various types of 
microarrays such as paired and non-paired microarrays from different laboratories or hospitals. Thus, it is important 
to analyze microarray data together to derive a combined conclusion after accounting for heterogeneity among 
data sets. One of the main objectives of the microarray experiment is to identify differentially expressed genes 
among the different experimental groups. We propose the linear mixed effect model for the integrated analysis of 
the heterogeneous microarray data sets. 

Results: The proposed linear mixed effect model was illustrated using the data from 133 microarrays collected at 
three different hospitals. Though simulation studies, we compared the proposed linear mixed effect model 
approach with the meta-analysis and the ANOVA model approaches. The linear mixed effect model approach was 
shown to provide higher powers than the other approaches. 

Conclusions: The linear mixed effect model has advantages of allowing for various types of covariance structures 
over ANOVA model. Further, it can handle easily the correlated microarray data such as paired microarray data and 
repeated microarray data from the same subject. 



Background 

Microarray technology has important applications in 
pharmaceutical and clinical research. For example, 
microarrays can be used to identify tumor-related genes 
and targets for therapeutic drugs. In microarray experi- 
ments, the identification of differentially expressed genes 
(DEG) is an important issue. Statistical test procedures 
have served as useful tools for identifying the DEGs 
which can be candidate genes for a specific disease or 
can be used for the further analysis such as clustering 
analysis and gene regulatory network construction. 

As the cost of producing microarrays has become 
lower costs and the importance of replication in micro- 
array experiments has been demonstrated by many 
researchers [1], replicated microarrays are commonly 
used in microarray experiments. In order to handle 
replicated microarrays, many statistical test procedures 
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have been developed, such as i-statistics, to identify 
DEGs between two groups [2] . The analysis of variance 
(ANOVA) model approach was proposed to identify 
DEGs among multiple groups [3]. In addition, many sta- 
tistical models have been proposed to identify the DEGs 
on replicated microarrays [4-11]. 

When the magnitude of a microarray experiment 
increases, it is common to use the same type of micro- 
arrays from different laboratories or hospitals. Thus, it is 
important to analyze microarray data together to derive 
a combined conclusion after accounting for the differ- 
ences. Recently, statistical approaches based on meta- 
analysis have been proposed in order to combine inde- 
pendent and heterogeneous microarray studies [12-15]. 
In these approaches, microarrays were classified into 
several independent groups and integration methods to 
analyze microarray data sets from different laboratories 
were proposed. The key idea of meta-analysis is to com- 
bine summary statistics from each study in which signif- 
icant levels (p-values) and effect sizes are commonly 
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used as summary statistics. Meta-analysis requires data 
be homogeneous within the data set. When there are 
microarray-specific covariates such as gender and smok- 
ing status, meta-analysis can be less effective. 

Shen et al. (2004) introduced the probability of 
expression (POE) and proposed a method to estimate 
the POE using MCMC [16]. The POE is the scale-free 
measure transformed from raw gene expression defined 
by the difference between probabilities of over- and 
under-expressed gene expression. Using the POE, the 
gene expressions of heterogenous microarray experi- 
ments can be uniquely scaled from -1 and 1 and com- 
bined easily. Choi et al. (2007) proposed EM algorithm 
to estimate the POE instead of MCMC, which can 
reduce the estimation time of the POE [17]. Standar- 
dized POE can combine multiple microarray data sets, 
however, the POE method can be more efficient when 
the microarray-specific covariates are applied. 

Park et al. [18] proposed a two-stage ANOVA model 
approach for the integrated analysis, which uses the 
ANOVA model with controlling variables for additional 
variability of heterogeneous microarray studies. The 
usual ANOVA model was extended to account for an 
additional variability resulting from many confounding 
variables. When variability among data sets is relatively 
small, the ANOVA model is effective. Otherwise, the 
ANOVA model is not recommended. Further, when the 
microarrays are correlated, the ANOVA model cannot 
handle such correlation appropriately, because it 
requires the independence of samples. Therefore, corre- 
lated microarray data can violate the assumption of the 
ANOVA model and thus the extended model to allow 
for various types of covariance structure of errors is 
needed. 

In this paper, we propose the linear mixed effect 
(LMe) model for the integrated analysis of the heteroge- 
neous microarray data sets. The LMe model contains 
various random effects which effectively account for the 
heterogeneous variability in the data from many differ- 
ent sources. Further, the LMe model has advantages of 
allowing for various types of covariance structures over 
meta-analysis and ANOVA model approaches. Thus, it 
can handle easily the correlated microarray data such as 
paired and non-paired microarray data. The proposed 
method is illustrated using the liver cancer microarray 
data sets obtained from three different hospitals [14]. 

Materials and methods 

Four independent microarray data sets were generated 
from three hospitals using two different chips [15]. The 
first chip, Ci, contains 10,336 human cDNA probes that 
were verified by single pass sequencing. The second 
chip, C 2 , contains 10,368 human cDNA probes. Two 
chips shared the common 9,984 cDNA probes. The 



chips were cDNA chips with two-colors, where the way 
of labeling samples and controls is described in Choi et 
al. (2004). A further detailed description of the chips 
has been uploaded to the Gene Expression Omnibus 
(GEO) site (http://www.ncbi.nlm.nih.gov/geo/) with 
GEO accession number GPL2911. 

The chip type (1 and 2), labeling scheme, hospital and 
number of samples are shown in this table. Here, the 
data were normalized by locally weighted scatterplot 
smoothing (LOWESS; Cleveland, 1979). For LOWESS 
normalization, the value of the span parameter was 0.75 
and the tricubic function was used as a weight function. 
For robustness analysis, Tukey's biweight function was 
used [18]. Hepatocellular carcinoma (HCC) and adjacent 
control (normal) samples were obtained with informed 
consent from patients at three hospitals. All the HCC 
samples were hepatitis B virus (HBV) positive. Sample 
preparation, microarray hybridizations, and fluorescence 
signal acquisitions were carried out independently at 
each institution according to similar but not identical 
experimental protocols and laboratory conditions. 

Table 1 summarizes the descriptive information for 
the microarray experiment. Thirty two microarrays were 
produced from 17 patients in Hospital A yielding data 
set Dl. A pair of microarrays for each patient were pro- 
duced from two cDNA samples: one from the HCC 
sample and the other from the control(normal) sample. 
In Hospital A, fifteen pairs of microarrays and two indi- 
vidual microarrays were produced. Forty six microarrays 
were produced from 23 patients in Hospital B yielding 
data set D2. Fifty five microarrays were produced from 
43 patients in Hospital C. Only twelve pairs of microar- 
rays and 37 individual microarrays were produced. Chip 
C 2 was used only on 21 microarrays from 13 patients in 
Hospital C. Other microarrays were produced by Chip 
C\. Microarrays from Hospital C were divided into two 
data sets (D3 and D4) according to the chip type. All 
microarray data were obtained using the reference 
design with the placenta as the reference. 

The LMe models 

Suppose there are H multiple data sets denoted by h = 
1, H. There are n h patients for the /zth data set. In 
our study, H = 4 and treatment groups consist of two 
levels denoted by k = T, C, where one (k = T) is the 
tumor tissue group and the other (k = C) is the control 
tissue group. For the paired observations, k has two 
values T and C. For the non-paired observation, k has 
only one value of T or C. Assume there are N common 
probes on each chip for all data sets. We denote genes 
by / (= 1,..., A/). The linear mixed effects (LMe) model 
consists of both fixed effects and random effects. The 
LMe model for the Zth gene is given by 

Yhil = X h ifli + Z hi[ b hi[ + Shil, 



Yi and Park BMC Bioinformatics 201 1, 12(Suppl 5):S3 
http://www.biomedcentral.eom/1 471 -2 1 05/1 2/S5/S3 



Page 3 of 8 



Table 1 Descriptive information for the liver cancer microarray data 


Data set ID 


Hospital 


Chip type 


Number of paired samples 


Number of non-paired samples 


Total number of samples 


tumor 


control 


tumor control 




D1 


A 


c, 


15 


15 


1 1 


32 


D2 


B 


c, 


23 


23 


0 0 


46 


D3 


C 


Ci 


4 


<1 


25 1 


34 


D4 


C 


c 2 


8 


8 


4 1 


21 



h = l, H, i = 1, n h , I = 1, N, (1) 

where Y hil is a response vector for the z'th subject 
(patient) of the hth data set, /J/ is the fixed effect para- 
meter vector, bhu is the random effect parameter vector, 
and £/,,/ is the error vector. Random effects and errors are 
assumed to be independent and normally distributed: 

b M ~ MO, *,), sui ~ MO, la 2 ). (2) 

The variance of random effects <!>/, can have several 
forms. When the off-diagonal terms are zero, then the 
random effects are uncorrelated. Otherwise, they are 
correlated. By allowing different forms of we can 
model variability among samples efficiently. When there 
are no random effects, say "L^u = 0, the LMe models 
become equivalent to the ANOVA models. 

For the liver cancer data, there are three fixed effects: 
treatment, hospital, and chip type. The LMe model is 
given by following equation: 

y MH = Pol + Pn x Wi + Pa x hM + PHJ x hM + Pn 2 l x hiM + ^hikl + B hM 

T fl if fe = T c Jl if Chip Cj 
* m ~[o iffe = C' *"*> ifChi P C 2 ' (3) 

H J 1 if Hospital A H Jl if Hospital B 
*™ = {o otherwise ' ^"{o otherwise ' 

where / = 1,..., 9984, h = I,..., 4, firi represents the 
treatment effect of differences between tumor tissue 
and control tissue, f5 C i represents the effect of differ- 
ences between two chips, and two parameters, (i H l 
and Ph 2 1> represent the effect of differences among 
hospitals. 

Types of covariance structure 

The most general form of covariance matrix in the LMe 
models assumes the covariance matrix of gene expres- 
sions within each data set is unstructured and differs 
among data sets. However, this covariance matrix 
requires many parameters to be estimated, which could 
result in a possible loss of power. Therefore, we need to 
consider simplified forms of the covariance matrices of 
t>M7- We consider four types of covariance forms for the 
integrated analysis of microarray data. For simplicity, we 
start with the case when the data consist of all paired 
observations. 



Paired microarrays 

1. Type 1: General form Covariance matrix of h hi (. 
Var(b i ) = Diag(/ 1 ,/ 2 / „) 



2 


„2 


fill 


°M2 


2 


„2 


hl2 


a h22 



2. Type 2: One common unstructured covariance 
matrix for all data sets Covariance matrix of b/,,;: 



Var(b M ) = / = 



2 


„2 


11 


<J l2 


2 


^2 


12 


a 22 



3. Type 3: Compound symmetry covariance matrix 
with different variance parameter for each data set b^u 
has only one component bhu and its variance is given by 

Var(b M ) = al, h = l H 

4. Type 4: One common compound symmetry covar- 
iance matrix with the same variance parameter for all 
data sets bhu has only one component bhu and its var- 
iance is given by 

Var(b M ) = <7 C 2 

Type 2 assumes the covariance matrix of gene expres- 
sions within each data set is unstructured like Type 1 
but it is the same over the data sets, which is a simpli- 
fied form of Type 1. Type 3 assumes each covariance 
matrix within the data set is compound symmetric and 
differs over the data sets. Type 4 is simplified version of 
Type 3 assuming the same covariance matrix over the 
data sets. 

For all types of covariance structure, the variance of 
Y m is given by 

Y M = V«r(b hu ) + la 2 . 
Non-paired microarrays 

For the non-paired observation, the dimension of Y hi i 
becomes one. The LMe allows only a scalar random 
effect. That is, b^a has only one component b^M- For 
example, Type 1 assumes 
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[<j h22 if 



Tests 

LMe model parameters can be estimated via maximum 
likelihood estimation. The DEGs can be identified by 
testing whether fin = 0 or not. LMe models also suffer 
from the multiple testing problem. We apply the FDR 
adjustment method proposed by Benjamini et al.[19]. 

Results 

Analysis of the liver cancer microarray data 

We applied the integrated analysis using LMe models, 
two-stage ANOVA model, and meta-analysis to liver 
cancer data. The LMe model is given in Equation 3. We 
fit this LMe model by assuming that b^a has the covar- 
iance structure of Types 1 to 4. These four models are 
denoted by Ml, M2, M3, and M4, respectively. The last 
LMe model M5 assumes no random effects and is 
expected to provide similar results to the two-stage 
ANOVA model. 

Table 2 shows the number of DEGs for the given 
FDRs. A much larger number of genes were identified 
by integrated analyses. When FDR is controlled by 5%, 
the number of DEGs on each individual microarray data 
set Dl to D4 are only 0, 38, 0, and 0, respectively [18]. 
However, the number of identified genes by meta-analy- 
sis, ANOVA model, five LMe models are 197, 145, 214, 
543, 589, 375, and 114, respectively. The number of sig- 
nificant genes varied across methods. The smallest num- 
ber of genes was selected by the LMe model M5 which 
requires the independent assumption between the paired 
microarrays. A similar number of genes was selected in 
meta-analysis and ANOVA model using the permuta- 
tion tests. Much larger number of genes were selected 
by LMe models Ml, M2, M3, and M4. The largest num- 
ber of genes was identified by M3 implying that M3 is 
more powerful method than other LMe models. How- 
ever, the results may contain false positive errors. In 



Table 2 Genes that are identified as differentially 
expressed when FDR is controlled 1%, 5%, 10%, and 
20%, respectively 



FDR 


Meta 
analysis 


Two-stage 
ANOVA 






LMe 












M1 


M2 


M3 


M4 


M5 


1% 


57 


46 


119 


184 


205 


124 


37 


5% 


197 


145 


214 


543 


589 


375 


114 


10 % 


303 


203 


339 


879 


978 


740 


181 


20 % 


478 


336 


585 


1500 


1761 


1323 


342 



Subsection Simulation Study, we investigate this issue 
through a simulation study and show that M3 controls 
Type I error rate well. 

Now, we focus on comparing the results. For simpli- 
city, we consider LMe model M3 only for summarizing 
the LMe models. Figure 1 shows the number of the sig- 
nificant genes identified by meta-analysis, two-stage 
ANOVA model, and LMe model M3 when FDR is 1%. 

The number of common genes selected by all three 
methods was only 16. Among them 9 genes are known 
and Table 3 summarizes their characteristics. The num- 
ber of common genes selected by two-stage ANOVA 
model and meta-analysis was 22. On the other hand, the 
numbers of the common genes selected by LMe with 
others are small. 

The number of genes identified only by M3 was 183 in 
the Figure 1 Some genes have been found to be related 
with liver disorders (BChE, C6, C9, CAP2, CDKN2A, 
CtBP, Cul4A, Gabl, Idl, NTRK1, PSG1, and PSMG). 
HChE was shown to exhibit highly elevated aryl acylami- 
dase activity (AAA). The absolute levels of AAA were 
increased as BChE activity decreased while deviating 
from normal samples and such deviation was directly 
proportional to the severity of the liver disorder [20]. 

C6 is a component of the complement system, which 
plays an important role as a humoral effector system 
during inflammation and infection, and consists of more 
than 25 components, including regulatory proteins. C6 
was shown to late-acting complement proteins that par- 
ticipate in the assembly of the membrane attack com- 
plex, which causes cell lysis by the formation of pores in 
the cell membrane of certain microorganisms. [21]. C9 
was related to the medication of tumor PDT by photo- 
sensitizer Photofrin using mouse Lewis lung carcinoma 
(LLC) model [22]. Cyclase-associated protein 2 (CAP2) 
was listed as an up-regulated gene in early hepatocellu- 
lar carcinoma (HCC) [23]. CDKN2A was reported to be 
differentially regulated by methylation between normal 




Figure 1 Number of genes that are identified by meta-analysis, 
two-stage ANOVA, and LMe model M3 when FDR is controlled 
by 1% 
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Table 3 Common genes detected by meta-analysis, two-stage ANOVA model, and LMe M3 model when FDR is 
controlled by 1% (9 known genes) 



Unigene ID Description 



Hs.82084 


Integrin beta 3 binding protein (beta3-endonexin) (ITGB3BP), mRNA 


Hs.514 


Cyclin H (CCNH), mRNA 


Hs.167529 


Cytochrome P450, subfamily IIC (mephenytoin 4-hydroxylase), polypeptide 9 (CYP2C9), mRNA 


Hs.1 17367 


Solute carrier family 22 (organic cation transporter), member 1 (SLC22A1), mRNA 


Hs.54900 


Serologically defined colon cancer antigen 1 (SDCCAG1), mRNA 


Hs.80756 


Betaine-homocysteine methyltransferase (BHMT), mRNA 


Hs.8765 


RNA helicase-related protein (RNAHP), mRNA 


Hs.755990 


Haptoglobin (HP), mRNA 


Hs.35101 


Proline-rich Gla (G-carboxyglutamic acid) polypeptide 2 (PRRG2), mRNA 



tissue and HCC. Low levels of methylation in normal 
tissue and adjacent tissue but high levels in HCC [24]. 
C-terminal binding protein (CtBP) was reported to 
relate with INK4A/ARF tumor suppressor gene. The 
INK4A/ARF tumor suppressor locus is frequently inacti- 
vated in HCC. Inhibition of cell invasion by pl9Arf was 
dependent on its C-terminal binding protein (CtBP) 
[25]. The Cul4A gene is amplified in human breast and 
liver cancers, and loss-of-function of Cul4 results in the 
accumulation of the replication licensing factor CDT1 in 
Caenorhabditis elegans embryos and ultraviolet (UV)- 
irradiated human cells [26]. 

Gabl was reported to be related with hepatic insulin 
action. Deletion of Gabl in the liver leads to enhanced 
glucose tolerance and improved hepatic insulin action. 
It was also shown that association of Gabl adaptor pro- 
tein and Shp2 tyrosine phosphatase is a critical event at 
the early phase of liver regeneration [27,28]. Idl was 
identified as TGF-/J/ALK1/Smadl target gene in HSCs 
and represents a critical mediator of transdifferentiation 
that might be involved in hepatic fibrogenesis. Trans- 
forming growth factor (TGF)-/3 is critically involved in 
the activation of hepatic stellate cells (HSCs) that occurs 
during the process of liver damage, for example, by alco- 
hol, hepatotoxic viruses, or aflatoxins [29,30]. NTRK1 
was reported to be a favorable neuroblastoma (NB) 
genes. NB is a common pediatric solid tumor that exhi- 
bits a striking clinical bipolarity: favorable and unfavor- 
able. High-level expression of NTRK1 predicts favorable 
NB outcome and inhibits growth of unfavorable NB 
cells [31]. PSG1 was reported to an up-regulated gene in 
a fetal liver [32]. PSMG was reported to significantly ele- 
vated expression in HCC [33]. 

Simulation study 

In order to evaluate the proposed methods, we simu- 
lated the two sets of microarray data and then per- 
formed the integrated analysis by using the proposed 
LMe method as well as other methods. For simplicity, 
we assume the log-transformed ratio of two intensities 



are normally distributed. To mimic the liver cancer 
microarray data, we assume that a pair of microarrays 
are obtained from the same patient. The first microarray 
data set consists of 60 microarrays from 30 patients and 
the second data set consists of another 60 microarrays 
from 30 patients. Suppose that two microarrays from 
the same patient are from different groups, say from 
tumor and control tissues. The main objective of the 
analysis is to identify the DEGs between two groups. 

We assume the number of genes in each microarray is 
30 among which the six genes (20%) are truly differen- 
tially expressed: three genes are over-expressed and the 
other three genes are under-expressed in the tumor tis- 
sue. The simulation model is given by 

Y hM = Po + Pt1 x MU + PDl x kikl + ^hikl + E hikl' 

h = l,2, i = 1,2,..., 30, fe = T, C, and I = 1, 2, ... , 30, 

where /3 D i represents a fixed effect of the difference 
between two data sets and ji T i represents a fixed effect 
for difference of expression levels between tumor and 
control tissues. The values of fi T iS are 1.5 for 1=1, 3, 
and -1.5 for / = 4, 6, respectively, and zero for 1 = 7, 
30. The values of floi are randomly determined by 
generating random variables from the standard normal 
distribution. Errors are also generated from the normal 
distribution with mean 0 and variance c 2 = 0.5 2 . 

For the random effect b hiki we assume three types of 
covariance matrix corresponding to Types 1, 2, and 3 
defined in Section Types of Covariance Structure. For 
Type 1, the covariance matrix b a ={b UTb b uc! , b 2 m> 
b 2 ici) T is given by 



e 2 u 


po- n o- 12 


0 


0 


P0- H CT 12 


2 


0 


0 


0 


0 


„2 
CT 21 


po- 21 o- 22 


0 


0 


po- 21 o- 22 


a 22 



We set values of parameters as On = 1, On = 2, <7 2 i = 
1.5, and <7 2 2 = 2.5. In addition, the correlation parameter 
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between tumor and control tissues are set as 0, 0.2, and 
0.4. For Type 2, two variance parameters are set as Oi = 
2.5 and <7 2 = 1, and the correlation parameters are set as 
0, 0.2, and 0.4 as Type 1. Finally, for Type 3, two var- 
iance parameters are set as <7i = 2.5 and <7 2 = 1- F° r the 
detailed information of the covariance structure, see 
Table 4. 

For the simulated data sets, we perform the analyses 
using the meta-analysis, the two-stage ANOVA model 
and five LMe models. We fit this LMe model by assum- 
ing that b,; has the covariance structure of Types 1 to 4. 
These four models are denoted by Ml, M2, M3, and 
M4, respectively. The last LMe model M5 is the one 
assuming no random effects, which is expected to pro- 
vide similar results to the two-stage ANOVA model. 

Table 5 shows powers and FDRs from 1,000 simu- 
lated data sets. The threshold value q was 0.05. Genes 
having ordered q values smaller than 0.05 were identi- 
fied as DEGs. Note that there are 6 true significant 
genes and 24 null genes. The empirical FDR values 
were computed as the number of false significant genes 
from 24 null genes divided by the total number of sig- 
nificant genes. The empirical power was computed as 
the number of significant genes among the 6 true 
genes divided by 6. 

When p was zero, powers and FDRs showed very con- 
sistent results for all methods, although the variances of 
tumor tissue and control tissue are assumed to be differ- 
ent. This means all methods perform similarly when the 
correlation between tumor and control tissues does not 
exist. 



Table 4 Setting for random effects b 



hikl 



Type 



random effects of 
subject 



covariance of random effects 



1 


Ka 
^im 
_ buci _ 




1 2p 0 0 
2p 2 2 0 0 
0 0 1.5 2 3.75p 
0 0 3.75p 2.5 2 


2 


bim 
Ka 
bun 

_ bliCl _ 




2.5 2 2.5p 0 0 
2.5p 1 0 0 
0 0 2.5 2 2.5p 
0 0 2.5p 1 


3 


Kl 
_ b 2il_ 




2.5 2 0 
0 1 





Table 5 summarizes the simulation results for Type 1 
covariance matrix. In general, meta-analysis, two-stage 
ANOVA model analysis, and M5 provided similar 
results in powers and FDRs. On the other hand, other 
LMe models provided quite different results. For exam- 
ple, the FDRs tend to be larger but maintain 5% level 
approximately except for Ml. Powers of LMe models 
tend to be much larger than meta-analysis and two- 
stage ANOVA model analysis. Among the five LMe 
models, Ml and M5 provide distinct results from the 
other three models M2, M3, and M4. 

It is interesting to note that the performance of each 
method depends on the value of p. For meta-analysis, 
two-stage ANOVA, and M5, the powers decrease as p 
increases. On the other hand, the powers of LMe mod- 
els Ml to M4 increase. These tendencies illustrate that 
meta-analysis and two-stage ANOVA do not handle 
correlations efficiently as LMe models do. 

FDRs of LMe models, M2, M3, and M4 are slightly 
larger than 0.05. However, the FDR of Ml is much lar- 
ger than 0.05, especially when p is close to zero. Thus, 
Ml is not appropriate to use when there is no correla- 
tion between tumor and control tissues. 

Table 5 also summarizes the simulation result for the 
Type 2 covariance matrix showing similar patterns with 
those of Type 1 except that the results are less sensitive 
to p. In summary, meta-analysis, ANOVA model analy- 
sis, and M5 provided similar results in powers and 
FDRs. On the other hand, other LMe models provided 
quite different results. Among the five LMe models, Ml 
and M5 provided distinct results from the other three 
LMe models. The powers of LMe models Ml to M4 are 
larger than meta-analysis, ANOVA, and M5. Although 
Ml has the largest power, it also shows the largest FDR. 

Finally, Table 5 also summarizes the simulation result 
for the Type 3 covariance matrix. Though correlation 
parameter p was not considered in this case, the correla- 
tion between tumor tissue and control tissue of same 
patient was assumed by the shared random parameter b hd . 
The results of simulated data under Type 3 are quite dif- 
ferent from those obtained from Types 1 and 2. That is, all 
LMe models, Ml, M2, M3, and M4 show extremely good 
performance. The powers are all 1 and FDRs are well-con- 
trolled around 0.05. LMe models work very well for this 
high correlation case. On the other hand, meta-analysis, 
ANOVA, and M5 performed worse. Among these, meta- 
analysis showed a slightly better performance. It is prob- 
ably due to the fact that the meta-analysis allows different 
variances between two data sets, while others do not. 

Discussion 

The LMe model is much more flexible than meta-analy- 
sis. One of the main limitations of meta-analysis is 
that it cannot handle the sample-specific covariates 
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Table 5 Power and FDR of methods under simulated data of Types 1, 2, and 3 covariance structures when FDR was 
controlled by 0.05 



Type 


P 




Meta analysis 


Two-stage ANOVA 






LMe 
















Ml 


M2 


M3 


M4 


M5 


1 


0 


Power 


0.2087 


0.1983 


0.2770 


0.2073 


0.2610 


0.2273 


0.2240 






FDR 


0.0740 


0.0600 


0.1150 


0.0730 


0.0863 


0.0746 


0.0807 




0.2 


Power 


0.1863 


0.1683 


0.3493 


0.2570 


0.2847 


0.2607 


0.1903 






FDR 


0.0345 


0.0307 


0.0958 


0.0666 


0.0707 


0.0668 


0.0371 




0.4 


Power 


0.1543 


0.1403 


0.4783 


0.3950 


0.4007 


0.3953 


0.1580 






FDR 


0.0170 


0.0186 


0.0718 


0.0558 


0.0573 


0.0557 


0.0104 


2 


0 


Power 


0.1453 


0.1423 


0.3650 


0.3067 


0.3093 


0.3073 


0.1570 






FDR 


0.0224 


0.0251 


0.0920 


0.0564 


0.0569 


0.0563 


0.0248 




0.2 


Power 


0.1290 


0.1347 


0.3373 


0.2867 


0.2867 


0.2867 


0.1450 






FDR 


0.0203 


0.0098 


0.0858 


0.0591 


0.591 


0.0591 


0.0247 




0.4 


Power 


0.1490 


0.1497 


0.3700 


0.3150 


0.3170 


0.3150 


0.1503 






FDR 


0.0283 


0.0323 


0.1091 


0.0680 


0.0676 


0.0680 


0.0363 


3 




Power 


0.1517 


0.0010 


1.000 


1 .0000 


1 .0000 


1 .0000 


0.0043 






FDR 


0.0000 


0.0000 


0.0712 


0.0455 


0.0455 


0.0455 


0.0000 



appropriately. Effect-size is simply the standardized 
mean difference between tumor tissue and control tissue 
[14]. Meta-analysis requires data are homogeneous 
within the data set, although data may be heterogeneous 
across data sets. For example, when there is sex infor- 
mation in data, the effect-size statistic cannot account 
for the sex effect directly. On the other hand, LMe mod- 
els can handle individual specific covariates easily. In 
microarray studies, many researchers want to account 
for the individual characteristics in the analysis by 
including them as controlling variables. For example, 
the covariates such as age, sex, tumor stage, and weight 
might be important controlling variables. These covari- 
ates are usually sample-specific and differ across 
samples. 

When there are no random effects, the LMe models 
become equivalent to the ANOVA models. The hetero- 
geneity among data sets is only represented by the fixed 
effects. When heterogeneity among data sets is small, 
the ANOVA model can easily handle the variability 
among the data sets. However, when data sets have high 
variability and contain the correlated data, the addition 
of only fixed effects may not be satisfactory. In this case, 
the LMe model is more appropriate to analyze data sets, 
because it can model the heterogeneous variance and 
correlation structure more appropriately. The proposed 
LMe model is capable of handling heterogeneous covar- 
iance structures by allowing for various random effects. 

When the data set contains paired and non-paired 
microarrays simultaneously, both meta-analysis and 
ANOVA model approaches cannot handle them appro- 
priately. For example, the meta-analysis and the ANOVA 



analysis treated paired microarrays as independent micro- 
arrays. On the other hand, the proposed LMes can han- 
dle appropriately the correlation between the paired 
microarrays. 

Finally, note that the proposed LMe model is valid when 
the normality assumption holds. We do not expect this 
assumption to hold for real microarray data. However, we 
expect the assumption is decreased when sufficiently large 
number of microarrays were combined. In future studies, 
we will develop permutation tests for the LMe models 
which do not require any distributional assumption. 

Conclusion 

We proposed the LMe model for the integrated analysis 
of microarray data to identify DEGs in the presence of 
many controlling variables. We analyzed the liver cancer 
microarray data set and simulated microarray data to 
evaluate the performance of the integration methods. 
LMe models except Ml maintained FDRs approxi- 
mately. Powers of LMe models except M5 tended to be 
much larger than meta-analysis and two-stage ANOVA 
model analysis. These tendencies illustrated that meta- 
analysis and two-stage ANOVA do not handle correla- 
tions efficiently as LMe models do. 
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